Roughly, it tells us how many examples (and computation)
we will need to see before we can learn a hypothesis is
probably H, where H is approximately correct.
PAC learning results are independent of the learning
algorithm used!
2.1 Concept Learning
GivenInstances $X$: Possible days, each described
by the attributes: Sky, AirTemp, Humidity, Wind, Water,
Forecast.
Target function $c$: $EnjoySport: X \rightarrow \{0,1 \}$
Hypotheses $H$: Conjunctions of literals. E.g.
\[\langle ?, Cold, High, ?, ?, ? \rangle. \]
Training examples $D$: Positive and negative examples of the target function
\[\langle x_1, c(x_1) \rangle , \ldots \langle x_m, c(x_m) \rangle\]
Determine:
A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$
in $D$?
A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$ in $X$?
2.1.1 Sample Complexity
How many training examples are sufficient to learn the
target concept?
The examples could come in several ways:
The learner proposes instances, as queries to teacher:
Learner proposes instance $x$, teacher provides $c(x)$.
Play 20 questions, divide space in half.
The teacher (who knows $c$) provides training examples
teacher provides sequence of examples of form $\langle x, c(x)
\rangle$. Best strategy depends on learner.
There is some random process that proposes the
instances. $x$ is generated randomly, teacher provides
$c(x)$.Hmmmm, this needs further analysis.
2.2 Problem Setting
Given a set of instances $X$, set of hypotheses $H$, set
of possible target concepts $C$, and training instances
generated by a fixed, unknown probability distribution
$\cal{D}$ over $X$
Learner observes a sequence $D$ of training examples of
form $\langle x, c(x)\rangle$, for some target concept $c \in
C$
The instances $x$ are drawn from distribution $\cal{D}$
and the teacher provides target value $c(x)$ for each.
Learner must output a hypothesis $h$ estimating $c$. $h$
is evaluated by its performance on subsequent instances drawn
according to $\cal{D}$
2.3 The Error of a Hypothesis
The true error (denoted $error_{\cal{D}}(h)$) of
hypothesis $h$ with respect to target concept $c$ and distribution
$\cal{D}$ is the probability that $h$ will misclassify an instance
drawn at random according to $\cal{D}$. \[error_{\cal{D}}(h) \equiv
\Pr_{x \in \cal{D}}[c(x) \neq h(x)] \]
That is, the proportion of hypotheses in the two
identified areas to the left.
The training error of $h$ with respect to $c$
represents how often it is wrong on the training data. That
is, \[\Pr_{x \in \cal{D} \wedge x \in X}[c(x) \neq h(x)] \]
We will start by assuming a training error of 0.
2.4 PAC Learnability
Given a definition of error, we can now define PAC
learnability.
A concept class $C$ is
PAC-learnable by $L$ using
$H$ if for all $c \in C$, distributions $\cal{D}$ over $X$,
$\epsilon$ such that $0 < \epsilon < 1/2$, and
$\delta$ such that $0 < \delta < 1/2$, learner $L$
will with probability at least $(1-\delta)$ output a
hypothesis $h \in H$ such that $error_{\cal{D}}(h) \leq
\epsilon$, in time that is polynomial in $1/\epsilon$,
$1/\delta$, $n$ and $size(c)$.
Notice that even thought we only mention computation time,
computation time grows with the number of examples
needed.
So, we want to PAC-learn some concept, but when can we do it?
3 Sample Complexity for Finite Hypothesis Spaces
The growth in the number of required training examples
with problem size is called the sample complexity
of the learning problem.
We will consider only consistent learners,
which are those that maintain a training error of 0.
We can derive a bound on the number of training examples
required by any consistent learner!
Fact: Every consistent learner outputs a hypothesis
belonging to the version space.
Therefore, we need to bound the number of examples needed
to assure that the version space contains no unacceptable
hypothesis.
The version space $VS_{H,D}$ is said to be
ε-exhausted with respect to $c$ and
$\cal{D}$, if every hypothesis $h$ in $VS_{H,D}$ has error
less than ε with respect to $c$ and $\cal{D}$.
\[(\forall h \in VS_{H,D}) error_{\cal{D}}(h) < \epsilon \]
3.1 How Many Examples Needed To Exhaust VS?
If the hypothesis space $H$ is finite, and $D$ is a sequence of $m \geq 1$
independent random examples of some target concept $c$, then for any $0 \leq
\epsilon \leq 1$, the probability that the version space with respect to $H$
and $D$ is not ε-exhausted (with respect to $c$) is less than \[|H|
e^{-\epsilon m} \]
...see book for proof.
This bounds the probability that any consistent learner
will output a hypothesis $h$ with $error(h) \geq \epsilon$ If
we want to this probability to be below $\delta$
\[|H|e^{- \epsilon m} \leq \delta \]
then
\[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(\frac{1}{\delta})) \]
So, any consistent learner can PAC-learn any
target concept in $H$ with $m$ examples, where PAC-learn means
that the hypothesis is probably ($1 - \delta$) approximately
(within error ε) correct.
3.2 Agnostic Learning
So far, we have assumed that assumed $c \in H$
A learning that does not assume that the target
concept is representable by $H$ and simply finds the
hypothesis with minimum training error is an agnostic
learner.
We want the hypothesis $h$ that makes fewest errors on
training data.
The sample complexity is
\[ m \geq \frac{1}{2 \epsilon^{2}}(\ln|H| + \ln(1/\delta)) \]
derived from Hoeffding bounds:
\[ Pr[error_{\cal{D}}(h) > error_D(h) + \epsilon] \leq e^{-2m\epsilon^{2}} \]
were $D$ is a set containing $m$ randomly drawn examples.
3.3 Conjunctions of Boolean Literals are PAC-Learnable
Consider the class C of target concepts described by
conjunctions of boolean literals, e.g. \[Old \wedge \not
Tall\]
Is it PAC-learnable? We need to show that any consistent
learner will require only a polynomial number of training
examples to learn any $c \in C$.
How many examples are sufficient to assure with
probability at least $(1 - \delta)$ that every $h$ in
$VS_{H,D}$ satisfies $error_{\cal{D}}(h) \leq \epsilon$?
Use our theorem:
\[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(\frac{1}{\delta})) \]
Suppose $H$ contains conjunctions of constraints on up to $n$ boolean
attributes (i.e., $n$ boolean literals). Then $|H| = 3^n$, and
\[m \geq \frac{1}{\epsilon}(\ln 3^n + \ln(\frac{1}{\delta})) \]
or
\[m \geq \frac{1}{\epsilon}(n \ln 3 + \ln(\frac{1}{\delta})) \]
So, $m$ grows linearly or below (i.e., polynomially).
For example, Find-S is a consistent learner that uses time
linear in $n$ for each example. Therefore, the class of
conjunction of boolean literals is PAC-learnable by the Find-S
algorithm using $H = C$.
3.4 The Unbiased Concept Class is Not PAC-Learnable
Consider the unbiased concept class that contains every
teachable concept from $X$.
\[|C| = 2^{|X|}\]
Let the instances be defined by $n$ boolean features.
\[|X| = 2^n\] so \[|C| = 2^{2^n}\]
To learn this, the learner must use $H = C$.
We then have that
\[ m \geq \frac{1}{\epsilon}(2^n \ln 2 + \ln (\frac{1}{\delta}))
\]
so the number of examples needed is exponential in $n$ (well, since this is just an upper bound its not a proof of this fact, but the fact is true nonetheless).
So far, we have been using $|H|$ in our equation for
sample complexity. Only works for finite $H$.
It can also lead to weak (i.e., too loose/high)
bounds.
We now show how to handle infinite $H$ by using the
Vapnik-Chervonenkis dimension.
4.1 Shattering a Set of Instances
A dichotomy of a set $S$ is a partition of $S$ into two disjoint subsets.
There are $2^{|S|}$ possible dichotomies in instance set $S$.
A set of instances is shattered by hypothesis
space $H$ if and only if for every dichotomy of $S$ there
exists some hypothesis in $H$ consistent with this
dichotomy.
The ability of $H$ to shatter a set of instances is a
measure of its capacity to represent target concepts over
them.
4.2 The Vapnik-Chervonenkis Dimension
An unbiased hypothesis space $H$ is one that shatters
instance space $X$. (and remember, unbiased is bad).
But, what if $H$ can instead shatter a large subset of
$X$?
The Vapnik-Chervonenkis dimension, $VC(H)$, of
hypothesis space $H$ defined over instance space $X$ is the
size of the largest finite subset of $X$ shattered by $H$. If
arbitrarily large finite sets of $X$ can be shattered by $H$,
then $VC(H) \equiv \infty$.
For any finite $H$ we have that
\[
\array{
VC(H) &= d \\
|H| &\geq 2^d \\
VC(H) &\leq \log_2|H|
} \]
4.3 VC Dimension of Linear Decision Surfaces
The VC dimension of linear decision surfaces on the plane
is 3.
The H is the set of all lines.
Every set of 2 points can be shattered. Some sets of 3
points can be shattered. There is no set of 4 points that can
be shattered.
4.4 Sample Complexity from VC Dimension
How many randomly drawn examples suffice to
$\epsilon$-exhaust $VS_{H,D}$ with probability at least $(1 -
\delta)$?
\[
m \geq \frac{1}{\epsilon}(4\log_2(\frac{2}{\delta}) + 8 VC(H) \log_2 (\frac{13}{\epsilon}))
\]
Similar to before but we replaced the $\ln |H|$ with
$VC(H)$ (recall that $VC(H) \leq \log_2|H|$).
We can also provide a lower bound on sample
complexity. If $VC(C) \geq 2$, $0 < \epsilon <
\frac{1}{8}$, and $0 < \delta < \frac{1}{100}$, then
there exists a distribution $\mathcal{D}$ and target concept C
such that if L observes fewer examples than
\[
\max \left( \frac{1}{\epsilon}\log(\frac{1}{\delta}), \frac{VC(C)-1}{32\epsilon} \right)
\]
then with probability at least $\delta$, L outputs a hypothesis $h$
having $error_{\mathcal{D}}(h) > \epsilon$.
Note that these are close bounds.
4.5 VC Dimension for Neural Nets
Define $C_G$ to be the G-composition of
$C$—the set of all functions that can be implemented by
network $G$ assuming units take on functions from class
$C$.
$C$ is the concept class represented by each neuron.
If G is a layered DAG with $n$ input nodes, $s \geq 2$
internal nodes each having at most $r$ inputs, and $VC(C) =
d$, then
\[ VC(C_G) \leq 2ds \cdot \log(es) \]
We can use this to bound the number of training examples
sufficient to learn (with prob. at least $1-\delta$) any
target concept from $C_G$ to within error ε.
5 Mistake Bound Model of Learning
Computational learning theory studies other models (other than PAC) were
the order of the training examples is varied,
there is noise in the data,
the definition of success is different,
the learner makes different assumptions about the distribution of instances, etc.
We will now look at the mistake bound model of learning in which the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis.
5.1 Mistake Bound Model: Find-S
Instances drawn at random from $X$ according to
distribution $\cal{D}$
Learner must classify each instance before receiving correct
classification from teacher.
Consider Find-S when $H=$ conjunction of boolean
literals.
Initialize $h$ to the most specific hypothesis $l_{1} \wedge \neg l_{1} \wedge l_{2} \wedge \neg l_{2}
\ldots l_{n} \wedge \neg l_{n}$
For each positive training instance $x$, remove from $h$ any literal that is not satisfied by $x$
Output hypothesis $h$.
How many mistakes before converging to correct $h$?
$n+1$. The first hypothesis eliminates $n$ terms,
each subsequent mistake will eliminate at least one more
term.
5.2 Mistake Bound Model: Halving Algorithm
Consider the Halving Algorithm
Learn concept using version space
Candidate-Elimination algorithm
Classify new instances by majority vote of version space members.
How many mistakes before converging to correct $h$?
$\log_2|H|$. Every mistake eliminates at least half
of the hypothesis from the version space (majority vote). The
version space starts at $|H|$.
5.3 Optimal Mistake Bounds
Let $M_{A}(C)$ be the max number of mistakes made by algorithm $A$ to learn
concepts in $C$. (maximum over all possible $c \in C$, and all possible
training sequences)
\[M_{A}(C) \equiv \max_{c \in C} M_{A}(c)\]
Let $C$ be an arbitrary non-empty concept class. The
optimal mistake bound for $C$, denoted $Opt(C)$, is the minimum
over all possible learning algorithms $A$ of $M_{A}(C)$. \[Opt(C)
\equiv \min_{A \in learning algorithms} M_{A}(C) \]
We can relate this to the VC dimension with
\[ VC(C) \leq Opt(C) \leq M_{Halving}(C) \leq log_{2}(|C|). \]
6 Summary
Research in Computational Learning Theory has lead to a
better understanding of what is learnable and how fast.
It has also lead to the development and improvement of
learning algorithms.
Like all others, it is an active research area. It has its
own conference (COLT) [6].